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Abstract 

What is happiness for reinforcement learning agents? We seek a for¬ 
mal definition satisfying a list of desiderata. Our proposed definition of 
happiness is the temporal difference error, i.e. the difference between the 
value of the obtained reward and observation and the agent’s expectation 
of this value. This definition satisfies most of our desiderata and is com¬ 
patible with empirical research on humans. We state several implications 
and discuss examples. 

Keywords. Temporal difference error, reward prediction error, pleasure, well¬ 
being, optimism, machine ethics 


1 Introduction 

People are constantly in search of better ways to be happy. However, philoso¬ 
phers and psychologists have not yet agreed on a notion of human happiness. In 
this paper, we pursue the more general goal of defining happiness for intelligent 
agents. We focus on the reinforcement learning (RL) setting |SB98| because it is 
an intensively studied formal framework which makes it easier to make precise 
statements. Moreover, reinforcement learning has been used to model behaviour 
in both human and non-human animals [Niv09j . 

Here, we decouple the discussion of happiness from the discussion of con¬ 
sciousness, experience, or qualia. We completely disregard whether happiness 
is actually consciously experienced or what this means. The problem of con¬ 
sciousness has to be solved separately; but its answer might matter insofar that 
it could tell us which agents’ happiness we should care about. 

Desiderata. We can simply ask a human how happy they are. But artificial 
reinforcement learning agents cannot yet speak. Therefore we use our human 
“common sense” intuitions about happiness to come up with a definition. We 
arrive at the following desired properties. 
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• Scaling. Happiness should be invariant under scaling of the rewards. Re¬ 
placing every reward rj by crt+d for some c,d € M. with c > 0 (independent 
of t) does not change the reinforcement learning problem in any relevant 
way. Therefore we desire a happiness measure to be independent under 
rescaling of the rewards. 

• Subjectivity. Happiness is a subjective property of the agent depending 
only on information available to the agent. For example, it cannot depend 
on the true environment. 

• Commensurability. The happiness of different agents should be compara¬ 
ble. If at some time step an agent A has happiness x, and another agent 
B has happiness y, then it should be possible to tell whether A is happier 
than B by computing x — y. This could be relaxed by instead asking that 
A can calculate the happiness of B according to A’s subjective beliefs. 

• Agreement. The happiness function should match experimental data about 
human happiness. 

It has to be emphasised that in humans, happiness cannot be equated with 
pleasure |R,SDD1^ . In the reinforcement learning setting, pleasure corresponds 
to the reward. Therefore happiness and reward have to be distinguished. We 
crudely summarise this as follows; for a more detailed discussion see [Section 3| 

pleasure = reward 7 ^ happiness 

The happiness measure that we propose is the following. An agent’s happiness 
in a time step t is the difference between the value of the obtained reward 
and observation and the agent’s expectation of this value at time step t. In 
the Markov setting, this is also known as the temporal difference error (TD 
error) [SB90| . However, we do not limit ourselves to the Markov setting in this 
paper. In parts of the mammalian brain, the neuromodulator dopamine has a 
strong connection to the TD error [NivOQj . Note that while our definition of 
happiness is not equal to reward it remains highly correlated to the reward, 
especially if the expectation of the reward is close to 0 . 

Our definition of happiness coincides with the definition for joy given by Ja¬ 
cobs et al. [JBJ14| . except that the latter is weighted by 1 minus the (objective) 
probability of taking the transition which violates subjectivity. Schmidhuber’s 
work on ‘intrinsic motivation’ adds a related component to the reward in order 
to motivate the agent to explore in interesting directions [SchlOj . 

Our definition of happiness can be split into two parts. (1) The difference 
between the instantaneous reward and its expectation, which we call payout, 
and ( 2 ) how the latest observation and reward changes the agent’s estimate of 
future rewards, which we call good news. Moreover, we identify two sources of 
happiness: luck, favourable chance outcomes (e.g. rolling a six on a fair die), 
and pessimism, having low expectations of the environment (e.g. expecting a 
fair die to be biased against you). We show that agents that know the world 
perfectly have zero expected happiness. Proofs can be found in [Appendix A 

In the rest of the paper, we use our definition as a starting point to investigate 
the following questions. Is an off-policy agent happier than an on-policy one? 
Do monotonically increasing rewards necessarily imply a happy agent? How 
does value function initialisation affect the happiness of an agent? Can we 
construct an agent that maximises its own happiness? 
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2 Reinforcement Learning 

In reinforcement learning (RL) an agent interacts with an environment in cycles: 
at time step t the agent chooses an action at € A and receives an observation 
Ot € O and a real-valued reward rt € M; the cycle then repeats for time step 
t + 1 [SB98) . The list of interactions ai 0 iria 202 r 2 ... is called a history. We 
use hf to denote a history of length t, and we use the shorthand notation h := 
ht-i and h' := ht-iatOtrt- The agent’s goal is to choose actions to maximise 
cumulative rewards. To avoid infinite sums, we use a discount factor 7 with 
0 < 7 < 1 and maximise the discounted sum A policy is a function 

TT mapping every history to the action taken after seeing this history, and an 
environment ^ is a stochastic mapping from histories to observation-reward- 
tuples. 

A policy TT together with an environment /r yields a probability distribution 
over histories. Given a random variable X over histories, we write the 7 r-/r- 
expectation of X conditional on the history h as [X | h]. 

The ( true ) value function of a policy tt in environment /i maps a history 
ht to the expected total future reward when interacting with environment /r and 
taking actions according to the policy tt: 

V;{ht)-.= V.l[T.Zt+il'^-*-^r,\ht]. (1) 

It is important to emphasise that E^ denotes the objective expectation that can 
be calculated only by knowing the environment fi. The optimal value function 
V* is defined as the value function of the optimal policy, V*{h) := sup^ 

Typically, reinforcement learners do not know the environment and are try¬ 
ing to learn it. We model this by assuming that at every time step the agent 
has (explicitly or implicitly) an estimate V of the value function Vff . Formally, 
a value function estimator maps a history h to a value function estimate V. 
Finally, we define an agent to be a policy together with a value function esti¬ 
mator. If the history is clear from context, we refer to the output of the value 
function estimator as the agent’s estimated value. 

If /i only depends on the last observation and action, /i is called Markov deci¬ 
sion process (MDP). In this case, fi{otrt \ ht-iat) = pi(ptrt \ Ot-iOt) and the ob¬ 
servations are called states (s* = 0 *). In MDPs we use the Q-value function, the 
value of a state-action pair, defined as a*) := E(( I StOt] . 

Assuming that the environment is an MDP is very common in the RL literature, 
but here we will not make this assumption. 


3 A Formal Definition of Happiness 


The goal of a reinforcement learning agent is to maximise rewards, so it seems 
natural to suppose an agent is happier the more rewards it gets. But this does 
not conform to our intuition: sometimes enjoying pleasures just fails to provide 
happiness, and reversely, enduring suffering does not necessarily entail unhap¬ 


piness (see Example 3 and Example 7). In fact, it has been shown empirically 
that rewards and happiness cannot be equated [RSDDI4] (p-value < 0.0001). 

There is also a formal problem with defining happiness in terms of reward: we 
can add a constant c G K to every reward. No matter how the agent-environment 
interaction plays out, the agent will have received additional cumulative rewards 
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C := X]i=i However, this did not change the structure of the reinforcement 
learning problem in any way. Actions that were optimal before are still optimal 
and actions that are slightly suboptimal are still slightly suboptimal to the same 
degree. For the agent, no essential difference between the original reinforcement 
learning problem and the new problem can be detected: in a sense the two 
problems are isomorphic. If we were to define an agent’s happiness as received 
reward, then an agent’s happiness would vary wildly when we add a constant 
to the reward while the problem stays structurally exactly the same. 

We propose the following definition of happiness. 

Definition 1 (Happiness). The happiness of a reinforcement learning agent 
with estimated value V at time step t with history hat while receiving observa¬ 
tion Ot and reward rj is 

©{hatOtTtj V) := rt + jV(hatOtrt) - V{h). (2) 

If ©{h', V) is positive, we say the agent is happy, and if ©{h', V) is negative, we 
say the agent is unhappy. 

It is important to emphasise that V represents the agent’s subjective estimate 
of the value function. If the agent is good at learning, this might converge to 
something close to the true value function . In an MDP (H is also known as 
the temporal difference error |SB90) . This number is used used to update the 
value function, and thus plays an integral part in learning. 

If there exists a probability distribution p on histories such that the value 
function estimate V is given by the expected future discounted rewards accord¬ 
ing to the probability distribution p, 

H(h)=E-[Er=t+i7'=-‘-V, |h], (3) 

then we call E := E)) the agent’s subjective expectation. Note that we can always 
find such a probability distribution, but this notion only really makes sense for 
model-based agents (agents that learn a model of their environment). Using the 
agent’s subjective expectation, we can rewrite [Definition l| as follows. 

Proposition 2 (Happiness as Subjective Expectation). Let E denote an agent’s 
subjective expectation. Then 

©ih', V)=rt- E[rt I /i] + 7 (v{h') - E[V{haor) \ h]) . (4) 

[Proposition 2| states that happiness is given by the difference of how good 

the agent thought it was doing and what it learns about how well it actually 
does. We distinguish the following two components in Q: 

• Payout: the difference of the obtained reward rt and the agent’s expecta¬ 
tion of that reward E\rt \ h]. 

• Good News: the change in opinion of the expected future rewards after 
receiving the new information otrt. 

©ih', V)=rt- E[rt I h] + 7 ( U(/i') - E[Vihaor) \ h]) 

V y v y 

■V* * 

payout good news 
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Example 3. Mary is travelling on an air plane. She knows that air planes crash 
very rarely, and so is completely at ease. Unfortunately she is flying on a budget 
airline, so she has to pay for her food and drink. A flight attendant comes to her 
seat and gives her a free beverage. Just as she starts drinking it, the intercom 
informs everyone that the engines have failed. Mary feels some happiness from 
the free drink (payout)^ but her expected future reward is much lower than in 
the state before learning the bad news. Thus overall, Mary is unhappy. 

For each of the two components, payout and good news, we distinguish the 
following two sources of happiness. 

• Pessimism^ the agent expects the environment to contain less rewards 
than it actually does. 

• Luck: the outcome of rt is unusually high due to randomness. 

Tt - E[rt \h]= n- | h]+W^[rt \ h] - E[rt \ h] 

'■ -V-^ '-V-' 

luck pessimism. 

V{h') - E[V{haor) \ h] = V{h') - El[V{haor) \ h] 

' - ,, -^ 

luck 

+ E^^[V{haor) \ h] - E[V{haor) \ h] 

pessimism 


Example 4. Suppose Mary fears flying and expected the plane to crash (pes¬ 
simism). On hearing that the engines failed (bad luck), Mary does not experi¬ 
ence very much change in her future expected reward. Thus she is happy that 
she (at least) got a free drink. 

The following proposition states that once an agent has learned the envi¬ 
ronment, its expected happiness is zero. In this case, underestimation cannot 
contribute to happiness and thus the only source of happiness is luck, which 
cancels out in expectation. 

Proposition 5 (Happiness of Informed Agents). An agent that knows the world 
has an expected happiness of zero: for every policy tt and every history h, 

Ei[©(h',v;:)\h] = Q. 

Analogously, if the environment is deterministic, then luck cannot be a source 
of happiness. In this case, happiness reduces to how much the agent underes¬ 
timates the environment. By [Proposition 5[ having learned a deterministic 
environment perfectly, the agent’s happiness is equal to zero. 


4 Matching the Desiderata 

Here we discuss in which sense our definition of happiness satisfies the desiderata 
from [Section II 

^ Optimism is a standard term in the RL literature to denote the opposite phenomenon. 
However, this notion is somewhat in discord with optimism in humans. 
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Scaling. If we transform the rewards to r[ = crt + d with c > 0, d G M. 
for each time step t without changing the value function, the value of © will 
be completely different. However, a sensible learning algorithm should be able 
to adapt to the new reinforcement learning problem with the scaled rewards 
without too much problem. At that point, the value function gets scaled as 
well, Vnew{h) = cV(h) + d/{l — 7 ). In this case we get 

© i^hcif^OtT , Vnew ) — A 7 ^new Vnew (h) 

= crt + d + 'jcV{hatOtr^) + 7 —^^- cV{h) - 

i — 7 J- ~ 7 

= c{rt+ jV(hatOtr^) -V{h)), 

hence happiness gets scaled by a positive factor and thus its sign remains the 
same, which would not hold if we defined happiness just in terms of rewards. 

Subjectivity. The definition 0 of © depends only on the current reward and 
the agent’s current estimation of the value function, both of which are available 
to the agent. 

Commensurability. The scaling property as described above means that the 
exact value of the happiness is not useful in comparing two agents, but the sign 
of the total happiness can at least tell us whether a given agent is happy or 
unhappy. Arguably, failing this desideratum is not surprising; in utility theory 
the utilities/rewards of different agents are typically not commensurable either. 

However, given two agents A and B, A can still calculate the A-subjective 
happiness of a history experienced by B as ©{haorB, V^). This corresponds to 
the human intuition of “putting yourself in someone else’s shoes”. If both agents 
are acting in the same environment, the resulting numbers should be commen¬ 
surable, since the calculation is done using the same value function. It is entirely 
possible that A believes B to be happier, i.e. ©{haorB, V^) > ©{haorA,V^), 
but also that B believes A to be happier ©{haorA,V^) > ©{haorB, V^), be¬ 
cause they have different expectations of the environment. 


Agreement. Rutledge et al. measure subjective well-being on a smartphone- 
based experiment with 18,420 participants |RSDD14] . In the experiment, a 
subject goes through 30 trials in each of which they can choose between a sure 
reward and a gamble that is resolved within a short delay. Every two to three 
trials the subjects are asked to indicate their momentary happiness. 

with a very simple learning algorithm 


Our model based on Proposition 2 


and no loss aversion correlates fairly well with reported happiness (mean r = 
0.56, median = 0.41, median = 0.27) while fitting individual discount 
factors, comparative to Rutledge et al.’s model (mean r = 0.60, median = 
0.47, median — 0.36) and a happiness=cumulative reward model (mean 
r = 0.59, median = 0.46, median R^ = 0.35). This analysis is inconclusive, 
but unsurprisingly so: the expected reward is close to 0 and thus our happiness 
model correlates well with rewards. See [Appendix B| for the details of our data 
analysis. 

The hedonic treadmill IBCTII refers to the idea that humans return to a 
baseline level of happiness after significant negative or positive events. Stud- 
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ies have looked at lottery winners and accident victims |BCJB78) . and peo¬ 
ple dealing with paralysis, marriage, divorce, having children and other life 
changes |DLS06j . In most cases these studies have observed a return to base¬ 
line happiness after some period of time has passed; people learn to make cor¬ 
rect reward predictions again. Hence their expected happiness returns to zero 
(Proposition 5). Our definition unfortunately does not explain why people have 
different baseline levels of happiness (or hedonic set points), but these may be 
perhaps explained by biological means (different humans have different levels 
of neuromodulators, neurotransmitters, hormones, etc.) which may move their 
baseline happiness. Alternatively, people might simply learn to associate differ¬ 
ent levels of happiness with “feeling happy” according to their environment. 


5 Discussion and Examples 

5.1 Off-policy Agents 

In reinforcement learning, we are mostly interested in learning the value function 
of the optimal policy. A common difference between RL algorithms is whether 
they learn off-policy or on-policy. An on-policy agent evaluates the value of the 
policy it is currently following. For example, the policy that the agent is made 
to follow could be an £-greedy policy, where the agent picks argmax^ a) 

a fraction (I — e) of the time, and a random action otherwise. If e is decreased 
to zero over time, then the agent’s learned policy tends to the optimal policy in 
MDPs. Alternatively, an agent can learn off-policy, that is it can learn about 
one policy (say, the optimal one) while following a different behaviour policy. 

The behaviour policy (tt^) determines how the agent acts while it is learning 
the optimal policy. Once an off-policy learning agent has learned the optimal 
value function Vf, then it is not happy if it still acts according to some other 
(possibly suboptimal) policy. 

Proposition 6 (Happiness of Off-Policy Learning). Let tt he some policy and 
pL be some environment. Then for any history h 

El[©{h\Vp\h]<Q. 

Q-learning is an example of an off-policy algorithm in the MDP setting. If 
Q-learning converges, and the agent is still following the sub-optimal behaviour 
policy then [Proposition 6| tells us that the agent will be unhappy. Moreover, this 
means that SARSA (an on-policy RL algorithm) will be happier than Q-learning 
on average and in expectation. 

5.2 Increasing and Decreasing Rewards 

Intuitively, it seems that if things are constantly getting better, this should 
increase happiness. However, this is not generally the case: even an agent that 
obtains monotonically increasing rewards can be unhappy if it thinks that these 
rewards mean even higher negative rewards in the future. 

Example 7. Alice has signed up for a questionable drug trial which examines 
the effects of a potentially harmful drug. This drug causes temporary pleasure to 
the user every time it is used, and increased usage results in increased pleasure. 
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Figure 1: MDP of Example 8 with 
transitions labelled with actions a or 
P and rewards. We use the discount 
factor 7 = 0.5. The agent starts in sg- 
Dehne 7ro(so) := a, then F’^“(so) = 0. 
The optimal policy is 7r*(so) = P, so 
F^*(so) = 1 and = 4. 


Figure 2: A plot of happiness for 


Example 8 We use the learning 
rate a = 0.1. The pessimistic agent 
has zero happiness (and rewards), 
whereas the optimistic agent is ini¬ 
tially unhappy, but once it transitions 
to state Si becomes happy. The plot 
also shows the rewards of the opti¬ 
mistic agent. 


However, the drug reduces quality of life in the long term. Alice has been 
informed of the potential side-effects of the drug. She can be either part of a 
placebo group or the group given the drug. Every morning Alice is given an 
injection of an unknown liquid. She finds herself feeling temporary but intense 
feelings of pleasure. This is evidence that she is in the non-placebo group, and 
thus has a potentially reduced quality of life in the long term. Even though she 
experiences pleasure (increasing rewards) it is evidence of very bad news and 
thus she is unhappy. 


Analogously, decreasing rewards do not generally imply unhappiness. Eor 
example, the pains of hard labour can mean happiness if one expects to harvest 
the fruits of this labour in the future. 


5.3 Value Function Initialisation 


Example 8 (Increasing Pessimism Does Not Increase Happiness). Consider the 
deterministic MDP example in [Figure 1[ Assume that the agent has an initial 
value function (5o(so,ck) = 0, (5o(so;/3) = —e, Qo(siiQ;) = e and (5o(si,/3) = 
0. If no forced exploration is carried out by the agent, it has no incentive 
to visit si. The happiness achieved by such an agent for some time step t is 
©(sgasoO, Vq) = 0 where Eo(so) := Qo(so, ct) = 0. However, suppose the agent is 
(more optimistically) initialised with Qo('SO)'a) = 0, QoCsoi/?) = In this case, 
the agent would take action P and arrive in state si. This transition would 
have happiness ©(sg^dsi —1,1^) = — 1 -I- 7(3o(si,<a) — Qq{so,P) = —1 — 0.5e. 
However, the next transition is siasi2 which has happiness ©(siasi2, Vg) = 
2 -I- jQo(si,a) — Qo{si,a) = 2 — O.Ser. If Qg is not updated by some learning 
mechanism the agent will continue to accrue this positive happiness for all future 
time steps. If the agent does learn, it will still be some time steps before Q 


converges to Q* and the positive happiness becomes zero (see Figure 2). It 

















is arguable whether this agent which suffered one time step of unhappiness 
but potentially many time steps of happiness is overall a happier agent, but it is 
some evidence that absolute pessimism does not necessarily lead to the happiest 
agents. 

5.4 Maximising Happiness 

How can an agent increase their own happiness? The first source of happi¬ 
ness, luck, depends entirely on the outcome of a random event that the agent 
has no control over. However, the agent could modify its learning algorithm 
to be systematically pessimistic about the environment. For example, when 
fixing the value function estimation below rmin/(l ~ 7 ) for all histories, happi¬ 
ness is positive at every time step. But this agent would not actually take any 
sensible actions. Just as optimism is commonly used to artificially increase ex¬ 
ploration, pessimism discourages exploration which leads to poor performance. 
As demonstrated in [Example 8[ a pessimistic agent may be less happy than a 
more optimistic one. 

Additionally, an agent that explicitly tries to maximise its own happiness 
is no longer a reinforcement learner. So instead of asking how an agent can 
increase its own happiness, we should fix a reinforcement learning algorithm 
and ask for the environment that would make this algorithm happy. 


6 Conclusion 

An artificial superintelligence might contain subroutines that are capable of 
suffering, a phenomenon that Bostrom calls mind crime |Bosl41 Ch. 8]. More 
generally, Tomasik argues that even current reinforcement learning agents could 
have moral weight [Toml4j . If this is the case, then a general theory of happi¬ 
ness for reinforcement learners is essential; it would enable us to derive ethical 
standards in the treatment of algorithms. Our theory is very preliminary and 
should be thought of as a small step in this direction. Many questions are left 
unanswered, and we hope to see more research on the suffering of AI agents in 
the future. 

Acknowledgements. We thank Marcus Hutter and Brian Tomasik for care¬ 
ful reading and detailed feedback. The data from the smartphone experiment 
was kindly provided by Robb Rutledge. We are also grateful to many of our 
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A Omitted Proofs 


Proof of \Proposition 4 

©{h',V)=rt+jV{h')-V{h) 


= rt+-fVih')-E 


^ ‘’’fc I h 


.k—t 


= rt- E[n I h] + -iV{h') - E 


.fc = t+l 


= Tt-E[rt I h]^-fV{h')--fE 


E I ^ 

oo 

E I W 


E 


,fc=t + l 


= Tt — E[rt I h] + "fV{h') — ^E V{haor) \ h 


where the second to last equality uses the tower property for conditional expec¬ 
tations. □ 

Proof of \Proposition <5[ For the true value function Vff, a subjective expectation 
exists by definition since (|^ is the same as ([T|). Hence we can apply Proposi- 
Ition 21 


w^[©{h',v;:)\h] 

= E-[rt - E;[r* | h] + jV^{h') - jE;[V^{haor) \ h] \ h] 

= E-[rt I h] - E-jE-jr* | h] \ h] + ^El[V;(h') \ h] - 7E;i[E;i[y;(W) \h]\h] 
= E -[rt I /r] - E- [r, \ h] + yE^ [P;(h') | h] - yE^ [P; {haor) | h] = 0 □ 

Proof of \Proposition 4 Let h' be any history of length t, and let tt* denote 
an optimal policy for environment /i, i.e., V* = Vff . In this case, we have 
V*{h) > V*{hTr{h)), and hence 


E 


TT* 


.k—t 


> E? 


Et'" *rk I hTr{h) 

.k—t 


We use this in the following: 


(5) 


El[©ih',Vp I h] 

= Ei[rt + ^v;:{h')-v;{h)\h] 


e; 

n + jEl 

oo 

E 7- 

1 

7 

-e;i* 

oo 

E 7''”*’'fc 1 h 



_k—t-\-l 



.k—t 


- E- 


rt 


■Ej’ 


E I 



E 7^ I hat I 


.k=t+l 


= E? 


n - E!^* [rt I hat] 


=0 


.k—t 


I h 

h 
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+ Ef 

OO 

E 

h' 

-K 

OO 

E 1 hat 

1 h 







since | hatOtn] 

1 h] = E-p-* 

[X 

=0 

1 hat] 1 

h] because conditional 


hat , the distribution of Ot and rt is independent of the policy. □ 


B Data Analysis 

In this section we describe the details of the data analysis on the Great Brain 
Experimenl|^ conducted by Rutledge et al. |RSDD14] . This experiment mea¬ 
sures subjective well-being on a smartphone app and had 18,420 subjects. Each 
subject goes through 30 trials and starts with 500 points. At the start of a trial, 
the subject is given an option to choose between a certain reward (CR) and a 
50-50 gamble between two outcomes. The gamble is pictorially shown to have 
an equal probability between the outcomes, thus making it easy for the subject 
to choose between the certain reward and the gamble. 

Before the trials start the agent is asked to rate their current happiness on 
a scale of 0-100. The slider for the scale is initialised randomly. Every two to 
three trials and 12 times in each play the subject is asked the question “How 
happy are you at this moment?” 

We model this experiment as a reinforcement learning problem with 150 
different states representing the different trials that the subject can see (the 
possible combinations of certain rewards and gamble outcomes), and two actions 
certain and gamble. For example, in a trial the subject could be presented 
with the choice between a certain reward of 30 and a 50-50 gamble between 0 
and 72. 

The expected reward after each state is uniquely determined by the state’s 
description, but the agent has subjective uncertainty about the value of the 
state, since it does not know which states (types of trials) will follow the current 
one or how these states are distributed. (In the experiment, the a priori expected 
outcome of a trial (the average value of the states) was 5.5 points, and the 
maximum gain or loss in a single trial is 220 points.) Furthermore, the subject 
might incorrectly estimate the value of the gamble: although the visualisation 
correctly suggests that each of the two outcomes is equally likely, the subject 
might be uncertain about this, or simply compute the average incorrectly. 

Rutledge et al. model happiness as an affine-linear combination of the certain 
reward (CR), the expected value of the gamble (EV) and the reward prediction 
error at the outcome of the gamble (RPE). The weights for this affine-linear 
combination were learned through linear regression on fMRI data from another 
similar experiment on humans (wcr = 0.52, wev = 0.35, and wrpe = 0.8 for 
z-scored happiness ratings). 

The data was kindly provided to us by Rutledge. We disregard the first 
happiness rating that occurs before the first trial. Moreover, we removed all 762 
subjects whose happiness ratings (other than the first) had a standard deviation 
of 0. 

■^http: //www. thegreatbrainexperiment. com/ 
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Figure 3: Distribution of the discount factor 7 (for happiness aggregation and 
rewards) fit per subject {N = 17658). The plot for Rutledge et al.’s model looks 
similar. 


To test our happiness definition on the data, we have to model how humans 
estimate the value of a history. We chose a very simple model where a subject’s 
expectation before each trial is the average of the previous outcomes. We use the 
empirical distribution function, that estimates all future rewards as the average 
outcome so far: 


p{rn I h) := 


#{fc < < I r/c = r„} 


for n > t. 


With Equation 3 this gives a value function estimate of V{aiOiri... atOtrt) = 


t(i-7) Si=i D- 

We assume that the subject typically computes the expected value of the 
trial correctly, i.e. E[rt \ h] := max{CR,EV}. We calculate happiness at time t 
with the formula from |Proposition 2| 

Since subjects’ happiness ratings are not given on every trial, we use geo¬ 
metric discounting with discount factor 7 to aggregate happiness from previous 
trials: 

t 

predicted happiness := E 7‘-'=©(hfc,'l4), 

k=l 

where hk is the history at time step k and 14 is the value function estimate at 
time step k. For each subject, we optimise 7 such that the Pearson correlation 
r is maximised. The result is that the majority of subjects use a 7 of either 


very close to 0 or very close to 1 (see Figure 3). A possible explanation of this 
phenomenon could be that it is unclear to the human subjects whether they are 
to indicate their instantaneous or their cumulative happiness. 

We use the sample Pearson correlation r, its square and the coefficient of 
determination to evaluate each model. For n data points (in our case n = 11, 
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Figure 4: The distribution of the 
correlation coefficients for our happi¬ 
ness model (iV = 17658). 
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Figure 5: The distribution of the 
correlation coefficients for Rutledge 
et al.’s model {N = 17658). 


the number of happiness ratings per subject), the sample Pearson correlation r 
is defined as 

- ^)iy^ - y) 

where x = - is the sample mean. The coefficient of determination 

on n data points with (training) data x G X and predictions y GY is 

We calculate only on z-scored terms; z-scoring does not affect r. 

Our model’s predicted happiness correlates fairly well with reported happi¬ 
ness (mean r = 0.56, median = 0.41, median R^ = 0.27) while fitting indi¬ 
vidual discount factors, comparative to Rutledge et al.’s model (mean r = 0.60, 
median = 0.47, median R^ = 0.36) and a happiness=cumulative reward 
model (mean r = 0.59, median = 0.46, median R? = 0.35). So this analysis 
is inconclusive. [Figure 4] and [Figure 5] show the distribution of the correlation 
coefficients. We emphasise that our model was derived a priori and then tested 
on the data, while their model has three parameters (weights for CR, EV, and 
RPE) that were fitted on data from humans using linear regression. 

It is well known that humans do not value rewards linearly. In particular, 
people appear to weigh losses more heavily than gains, an effect known as loss 
aversion. To model this, we set 


reward := 


outcome" 

—A • (—outcome)^ 


if outcome > 0 
otherwise. 


Using parameters found by Rutledge et al. (mean A = 1.7, mean a = 1.05, mean 
/3 = 1.01) we also get a slightly improved agreement (mean r — 0.60, median 
= 0.48, median R^ = 0.39). Table 1 lists these results. 
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Model 


_1 



r 


med 



med B^ 

normal 

0.52 


0.34 

0.56 


0.30 

0.41 

0.12 

± 

0.61 

0.27 

loss aversion 

0.53 

± 

0.34 

0.60 

± 

0.31 

0.48 

0.20 

± 

0.61 

0.39 

Rutledge et al. 

0.55 


0.35 

0.60 


0.30 

0.47 

0.19 

■±' 

0.60 

0.36 

pleasure 

0.55 


0.35 

0.59 


0.30 

0.46 

0.18 


0.60 

0.35 


Table 1: Discount factor 7 , correlation coefficient r, and goodness-of-fit mea¬ 
sure B? for various versions of our model, compared to Rutledge et al. and a 
happiness=pleasure model (N = 17658). 
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